Megane Crenshaw
With the progress that has been made in technology and machine learning in the past decade, data science has become a very popular and in demand profession. This tutorial aims to give an introduction to data science by exploring and analyzing data from the FBI. The data we will be using comes from the Uniform Crime Reporting (UCR) program. In this program, law enforcement agencies across the nation provide data to the FBI, who then publishes it. For more information on this program you can visit https://www.fbi.gov/services/cjis/ucr. We will be using data from 2007 to 2017 to see if there is a relationship between number of officers and crime rate in order to hopefully inform policy and decision making in law enforcement.
For this tutorial you will be using the pandas library to read and organize the data, and matplotlib.pyplot to graph the data. For exploring and analyzing the data we will also be using SciKit-Learn. This tutorial assumes prior knowledge.
This is the first step in the data lifecycle. Here we just gather the data through methods like web scraping. Fortunately, the FBI provides Excel tables that we can download, so we don't need to scrape the web. To download these tables yourself, visit https://ucr.fbi.gov/crime-in-the-u.s/, click on the year you want data for, and then navigate to the violent crimes tables and police employee tables. We will be using table 4 for crime data and table 71 for officer data (the data is organized differently for 2016, so for 2016 we will be using tables 2 and 26). To read the tables use the pandas function read_excel().
import pandas
import re
import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
crime_2017 = pandas.read_excel("2017 Table 4.xls")
officer_2017 = pandas.read_excel("2017 Table 71.xls")
crime_2016 = pandas.read_excel("2016 Table 2.xls")
officer_2016 = pandas.read_excel("2016 Table 24.xls")
crime_2015 = pandas.read_excel("2015 Table 4.xls")
officer_2015 = pandas.read_excel("2015 Table 71.xls")
crime_2014 = pandas.read_excel("2014 Table 4.xls")
officer_2014 = pandas.read_excel("2014 Table 71.xls")
crime_2013 = pandas.read_excel("2013 Table 4.xls")
officer_2013 = pandas.read_excel("2013 Table 71.xls")
crime_2012 = pandas.read_excel("2012 Table 4.xls")
officer_2012 = pandas.read_excel("2012 Table 71.xls")
crime_2011 = pandas.read_excel("2011 Table 4.xls")
officer_2011 = pandas.read_excel("2011 Table 71.xls")
crime_2010 = pandas.read_excel("2010 Table 4.xls")
officer_2010 = pandas.read_excel("2010 Table 71.xls")
crime_2009 = pandas.read_excel("2009 Table 4.xls")
officer_2009 = pandas.read_excel("2009 Table 71.xls")
crime_2008 = pandas.read_excel("2008 Table 4.xls")
officer_2008 = pandas.read_excel("2008 Table 71.xls")
crime_2007 = pandas.read_excel("2007 Table 4.xls")
officer_2007 = pandas.read_excel("2007 Table 71.xls")
crime_2017.head(10)
So now we have all our data, but as you can see it is messy. In the data processing step of the lifecycle, we organize and "fix" the data so that it is readable and easy to use in the next steps, visualization and analysis. For this data, to "fix" it we will rename the columns, and change the index to be a MultiIndex. In addition, we will only be looking at data per region, so we can drop all the rows for individual states. For more information on what tidy data should look like, visit https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html.
The data for officers only has data per region, so we will remove the data for individual states from the crime tables since we will not be using it.
#remove empty columns
crime_2017_fixed = crime_2017.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
#rename columns
crime_2017_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
#remove empty rows
crime_2017_fixed = crime_2017_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209])
# we are including DC and Puerto Rico because they are not regions are therefore not in the officer tables, so we
# will not be using them
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
"Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
"Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
"Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
"Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
"North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
"Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
"Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming", "District of Columbia", "Puerto Rico"]
to_delete = []
#remove states from the data
for index, row in crime_2017_fixed.iterrows():
area_name = str(row["Area"]).strip()
area_name = re.sub(r'\d+', '', area_name)
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2017_fixed = crime_2017_fixed.drop(to_delete)
#change the index to a multi-index with 2 levels, one for region and one for year
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2016', '2017', 'Percent Change']],
names=['Area', 'Year'])
crime_2017_fixed.index = idx
#drop the area and year columns since they are now unnecessary
crime_2017_fixed = crime_2017_fixed.drop(["Area", "Year"], axis = 1)
crime_2017_fixed.head(10)
The data looks much better and much more readable. Now we do this for the rest of the tables.
crime_2016_fixed = crime_2016.drop(["Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2016_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2016_fixed = crime_2016_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209,210,211,212])
to_delete = []
for index, row in crime_2016_fixed.iterrows():
area_name = str(row["Area"]).strip()
area_name = re.sub(r'\d+', '', area_name)
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2016_fixed = crime_2016_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2015', '2016', 'Percent Change']],
names=['Area', 'Year'])
crime_2016_fixed.index = idx
crime_2016_fixed = crime_2016_fixed.drop(["Area", "Year"], axis = 1)
crime_2015_fixed = crime_2015.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2015_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2015_fixed = crime_2015_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209,210,211,212, 213])
to_delete = []
for index, row in crime_2015_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2015_fixed = crime_2015_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2014', '2015', 'Percent Change']],
names=['Area', 'Year'])
crime_2015_fixed.index = idx
crime_2015_fixed = crime_2015_fixed.drop(["Area", "Year"], axis = 1)
crime_2014_fixed = crime_2014.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2014_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2014_fixed = crime_2014_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212, 213])
to_delete = []
for index, row in crime_2014_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2014_fixed = crime_2014_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2013', '2014', 'Percent Change']],
names=['Area', 'Year'])
crime_2014_fixed.index = idx
crime_2014_fixed = crime_2014_fixed.drop(["Area", "Year"], axis = 1)
crime_2013_fixed = crime_2013.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2013_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2013_fixed = crime_2013_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213])
to_delete = []
for index, row in crime_2013_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2013_fixed = crime_2013_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2012', '2013', 'Percent Change']],
names=['Area', 'Year'])
crime_2013_fixed.index = idx
crime_2013_fixed = crime_2013_fixed.drop(["Area", "Year"], axis = 1)
crime_2012_fixed = crime_2012.drop(["Unnamed: 21"], axis = 1)
crime_2012_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2012_fixed = crime_2012_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209])
to_delete = []
for index, row in crime_2012_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2012_fixed = crime_2012_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2011', '2012', 'Percent Change']],
names=['Area', 'Year'])
crime_2012_fixed.index = idx
crime_2012_fixed = crime_2012_fixed.drop(["Area", "Year"], axis = 1)
crime_2011_fixed = crime_2011.drop(["Unnamed: 21", "Unnamed: 22"], axis = 1)
crime_2011_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2011_fixed = crime_2011_fixed.drop([0,1,2,3,202,203,204,205,206,207,208])
to_delete = []
for index, row in crime_2011_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2011_fixed = crime_2011_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2010', '2011', 'Percent Change']],
names=['Area', 'Year'])
crime_2011_fixed.index = idx
crime_2011_fixed = crime_2011_fixed.drop(["Area", "Year"], axis = 1)
crime_2010_fixed = crime_2010.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2010_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2010_fixed = crime_2010_fixed.drop([0,1,2,3,199,200,201,202,203,204])
to_delete = []
for index, row in crime_2010_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2010_fixed = crime_2010_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2009', '2010', 'Percent Change']],
names=['Area', 'Year'])
crime_2010_fixed.index = idx
crime_2010_fixed = crime_2010_fixed.drop(["Area", "Year"], axis = 1)
crime_2009_fixed = crime_2009.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2009_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2009_fixed = crime_2009_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213,214,215])
to_delete = []
for index, row in crime_2009_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2009_fixed = crime_2009_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2008', '2009', 'Percent Change']],
names=['Area', 'Year'])
crime_2009_fixed.index = idx
crime_2009_fixed = crime_2009_fixed.drop(["Area", "Year"], axis = 1)
crime_2008_fixed = crime_2008.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2008_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2008_fixed = crime_2008_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210])
to_delete = []
for index, row in crime_2008_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2008_fixed = crime_2008_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2007', '2008', 'Percent Change']],
names=['Area', 'Year'])
crime_2008_fixed.index = idx
crime_2008_fixed = crime_2008_fixed.drop(["Area", "Year"], axis = 1)
crime_2007_fixed = crime_2007.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2007_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000",
"Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition",
"Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
"Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft",
"Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]
crime_2007_fixed = crime_2007_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213,214,215,216])
to_delete = []
for index, row in crime_2007_fixed.iterrows():
area_name = str(row["Area"]).replace(",", "").strip()
area_name = re.sub(r'\d+', '', area_name).strip()
if area_name in states:
to_delete.extend([index,index+1, index+2])
crime_2007_fixed = crime_2007_fixed.drop(to_delete)
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['2006', '2007', 'Percent Change']],
names=['Area', 'Year'])
crime_2007_fixed.index = idx
crime_2007_fixed = crime_2007_fixed.drop(["Area", "Year"], axis = 1)
officer_2017_fixed = officer_2017.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
columns = ["Region", "Number of Officers", "Total", "Group 1", "Group 2", "Group 3", "Group 4", "Group 5",
"Group 6"]
officer_2017_fixed.columns = columns
officer_2017_fixed = officer_2017_fixed.drop([0,1,2,3,32,33,34,35])
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
'Midwest', 'East North Central', 'West North Central', 'South',
'South Atlantic', 'East South Central',
'West South Central', 'West', 'Mountain', 'Pacific'],
['Number of Officers', 'Average per 1,000']],
names=['Region', 'Officers'])
officer_2017_fixed.index = idx
officer_2017_fixed = officer_2017_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2017_fixed.head(10)
officer_2016_fixed = officer_2016.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2016_fixed.columns = columns
officer_2016_fixed = officer_2016_fixed.drop([0,1,2,3,32,33])
officer_2016_fixed.index = idx
officer_2016_fixed = officer_2016_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2015_fixed = officer_2015.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2015_fixed.columns = columns
officer_2015_fixed = officer_2015_fixed.drop([0,1,2,3,32,33])
officer_2015_fixed.index = idx
officer_2015_fixed = officer_2015_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2014_fixed = officer_2014.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2014_fixed.columns = columns
officer_2014_fixed = officer_2014_fixed.drop([0,1,2,3,32,33])
officer_2014_fixed.index = idx
officer_2014_fixed = officer_2014_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2013_fixed = officer_2013.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2013_fixed.columns = columns
officer_2013_fixed = officer_2013_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2013_fixed.index = idx
officer_2013_fixed = officer_2013_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2012_fixed = officer_2012.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2012_fixed.columns = columns
officer_2012_fixed = officer_2012_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2012_fixed.index = idx
officer_2012_fixed = officer_2012_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2011_fixed = officer_2011.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2011_fixed.columns = columns
officer_2011_fixed = officer_2011_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2011_fixed.index = idx
officer_2011_fixed = officer_2011_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2010_fixed = officer_2010.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2010_fixed.columns = columns
officer_2010_fixed = officer_2010_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2010_fixed.index = idx
officer_2010_fixed = officer_2010_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2009_fixed = officer_2009.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2009_fixed.columns = columns
officer_2009_fixed = officer_2009_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2009_fixed.index = idx
officer_2009_fixed = officer_2009_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2008_fixed = officer_2008.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2008_fixed.columns = columns
officer_2008_fixed = officer_2008_fixed.drop([0,1,2,3,32,33,34,35,36,37,38])
officer_2008_fixed.index = idx
officer_2008_fixed = officer_2008_fixed.drop(["Region", "Number of Officers"], axis=1)
officer_2007_fixed = officer_2009.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2007_fixed.columns = columns
officer_2007_fixed = officer_2007_fixed.drop([0,1,2,3,32,33,34,35,36,37])
officer_2007_fixed.index = idx
officer_2007_fixed = officer_2007_fixed.drop(["Region", "Number of Officers"], axis=1)
Our data is much cleaner now; however, we are not done. In the visualization step we want to plot things over time, so it would be helpful to reorganize the tables to be per region instead of per year to make plotting easier.
Merging two tables uses the index to line up the rows of the two tables, so we change the index on both the crime and officer tables to make sure the tables merge properly.
#gather the data for total crime in the US for each year
total_crime = pandas.DataFrame([crime_2017_fixed.iloc[1], crime_2016_fixed.iloc[1], crime_2015_fixed.iloc[1],
crime_2014_fixed.iloc[1], crime_2013_fixed.iloc[1], crime_2012_fixed.iloc[1],
crime_2011_fixed.iloc[1], crime_2010_fixed.iloc[1], crime_2009_fixed.iloc[1],
crime_2008_fixed.iloc[1], crime_2007_fixed.iloc[1],
])
#add a column for year and change the index back to default
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
total_crime.insert(loc=0, column="Year", value=years)
total_crime.index=[0,1,2,3,4,5,6,7,8,9,10]
#gather data for officers in the US for each year
total_officers = pandas.DataFrame([officer_2017_fixed.iloc[1], officer_2016_fixed.iloc[1], officer_2015_fixed.iloc[1],
officer_2014_fixed.iloc[1], officer_2013_fixed.iloc[1], officer_2012_fixed.iloc[1],
officer_2011_fixed.iloc[1], officer_2010_fixed.iloc[1], officer_2009_fixed.iloc[1],
officer_2008_fixed.iloc[1], officer_2007_fixed.iloc[1]])
#change the index to default and merge the crime and officer tables
total_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
total = pandas.concat([total_crime, total_officers], sort=False, axis=1)
total.name = "total"
total
Now we do this for the rest of the regions.
northeast = pandas.DataFrame([crime_2017_fixed.iloc[4], crime_2016_fixed.iloc[4], crime_2015_fixed.iloc[4],
crime_2014_fixed.iloc[4], crime_2013_fixed.iloc[4], crime_2012_fixed.iloc[4],
crime_2011_fixed.iloc[4], crime_2010_fixed.iloc[4], crime_2009_fixed.iloc[4],
crime_2008_fixed.iloc[4], crime_2007_fixed.iloc[4]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
northeast.insert(loc=0, column="Year", value=years)
northeast.index=[0,1,2,3,4,5,6,7,8,9,10]
northeast_officers = pandas.DataFrame([officer_2017_fixed.iloc[3], officer_2016_fixed.iloc[3], officer_2015_fixed.iloc[3],
officer_2014_fixed.iloc[3], officer_2013_fixed.iloc[3], officer_2012_fixed.iloc[3],
officer_2011_fixed.iloc[3], officer_2010_fixed.iloc[3], officer_2009_fixed.iloc[3],
officer_2008_fixed.iloc[3], officer_2007_fixed.iloc[3]])
northeast_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
northeast = pandas.concat([northeast, northeast_officers], sort=False, axis=1)
northeast.name = "Northeast"
new_england = pandas.DataFrame([crime_2017_fixed.iloc[7], crime_2016_fixed.iloc[7], crime_2015_fixed.iloc[7],
crime_2014_fixed.iloc[7], crime_2013_fixed.iloc[7], crime_2012_fixed.iloc[7],
crime_2011_fixed.iloc[7], crime_2010_fixed.iloc[7], crime_2009_fixed.iloc[7],
crime_2008_fixed.iloc[7], crime_2007_fixed.iloc[7]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
new_england.insert(loc=0, column="Year", value=years)
new_england.index=[0,1,2,3,4,5,6,7,8,9,10]
new_england_officers = pandas.DataFrame([officer_2017_fixed.iloc[5], officer_2016_fixed.iloc[5], officer_2015_fixed.iloc[5],
officer_2014_fixed.iloc[5], officer_2013_fixed.iloc[5], officer_2012_fixed.iloc[5],
officer_2011_fixed.iloc[5], officer_2010_fixed.iloc[5], officer_2009_fixed.iloc[5],
officer_2008_fixed.iloc[5], officer_2007_fixed.iloc[5]])
new_england_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
new_england = pandas.concat([new_england, new_england_officers], sort=False, axis=1)
new_england.name = "New England"
middle_atlantic = pandas.DataFrame([crime_2017_fixed.iloc[10], crime_2016_fixed.iloc[10], crime_2015_fixed.iloc[10],
crime_2014_fixed.iloc[10], crime_2013_fixed.iloc[10], crime_2012_fixed.iloc[10],
crime_2011_fixed.iloc[10], crime_2010_fixed.iloc[10], crime_2009_fixed.iloc[10],
crime_2008_fixed.iloc[10], crime_2007_fixed.iloc[10]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
middle_atlantic.insert(loc=0, column="Year", value=years)
middle_atlantic.index=[0,1,2,3,4,5,6,7,8,9,10]
middle_atlantic_officers = pandas.DataFrame([officer_2017_fixed.iloc[7], officer_2016_fixed.iloc[7], officer_2015_fixed.iloc[7],
officer_2014_fixed.iloc[7], officer_2013_fixed.iloc[7], officer_2012_fixed.iloc[7],
officer_2011_fixed.iloc[7], officer_2010_fixed.iloc[7], officer_2009_fixed.iloc[7],
officer_2008_fixed.iloc[7], officer_2007_fixed.iloc[7]])
middle_atlantic_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
middle_atlantic = pandas.concat([middle_atlantic, middle_atlantic_officers], sort=False, axis=1)
middle_atlantic.name = "Middle Atlantic"
midwest = pandas.DataFrame([crime_2017_fixed.iloc[13], crime_2016_fixed.iloc[13], crime_2015_fixed.iloc[13],
crime_2014_fixed.iloc[13], crime_2013_fixed.iloc[13], crime_2012_fixed.iloc[13],
crime_2011_fixed.iloc[13], crime_2010_fixed.iloc[13], crime_2009_fixed.iloc[13],
crime_2008_fixed.iloc[13], crime_2007_fixed.iloc[13]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
midwest.insert(loc=0, column="Year", value=years)
midwest.index=[0,1,2,3,4,5,6,7,8,9,10]
midwest_officers = pandas.DataFrame([officer_2017_fixed.iloc[9], officer_2016_fixed.iloc[9], officer_2015_fixed.iloc[9],
officer_2014_fixed.iloc[9], officer_2013_fixed.iloc[9], officer_2012_fixed.iloc[9],
officer_2011_fixed.iloc[9], officer_2010_fixed.iloc[9], officer_2009_fixed.iloc[9],
officer_2008_fixed.iloc[9], officer_2007_fixed.iloc[9]])
midwest_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
midwest = pandas.concat([midwest, midwest_officers], sort=False, axis=1)
midwest.name = "Midwest"
east_north_central = pandas.DataFrame([crime_2017_fixed.iloc[16], crime_2016_fixed.iloc[16], crime_2015_fixed.iloc[16],
crime_2014_fixed.iloc[16], crime_2013_fixed.iloc[16], crime_2012_fixed.iloc[16],
crime_2011_fixed.iloc[16], crime_2010_fixed.iloc[16], crime_2009_fixed.iloc[16],
crime_2008_fixed.iloc[16], crime_2007_fixed.iloc[16]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
east_north_central.insert(loc=0, column="Year", value=years)
east_north_central.index=[0,1,2,3,4,5,6,7,8,9,10]
east_north_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[11], officer_2016_fixed.iloc[11], officer_2015_fixed.iloc[11],
officer_2014_fixed.iloc[11], officer_2013_fixed.iloc[11], officer_2012_fixed.iloc[11],
officer_2011_fixed.iloc[11], officer_2010_fixed.iloc[11], officer_2009_fixed.iloc[11],
officer_2008_fixed.iloc[11], officer_2007_fixed.iloc[11]])
east_north_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
east_north_central = pandas.concat([east_north_central, east_north_central_officers], sort=False, axis=1)
east_north_central.name = "East North Central"
west_north_central = pandas.DataFrame([crime_2017_fixed.iloc[19], crime_2016_fixed.iloc[19], crime_2015_fixed.iloc[19],
crime_2014_fixed.iloc[19], crime_2013_fixed.iloc[19], crime_2012_fixed.iloc[19],
crime_2011_fixed.iloc[19], crime_2010_fixed.iloc[19], crime_2009_fixed.iloc[19],
crime_2008_fixed.iloc[19], crime_2007_fixed.iloc[19]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west_north_central.insert(loc=0, column="Year", value=years)
west_north_central.index=[0,1,2,3,4,5,6,7,8,9,10]
west_north_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[13], officer_2016_fixed.iloc[13], officer_2015_fixed.iloc[13],
officer_2014_fixed.iloc[13], officer_2013_fixed.iloc[13], officer_2012_fixed.iloc[13],
officer_2011_fixed.iloc[13], officer_2010_fixed.iloc[13], officer_2009_fixed.iloc[13],
officer_2008_fixed.iloc[13], officer_2007_fixed.iloc[13]])
west_north_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west_north_central = pandas.concat([west_north_central, west_north_central_officers], sort=False, axis=1)
west_north_central.name = "West North Central"
south = pandas.DataFrame([crime_2017_fixed.iloc[22], crime_2016_fixed.iloc[22], crime_2015_fixed.iloc[22],
crime_2014_fixed.iloc[22], crime_2013_fixed.iloc[22], crime_2012_fixed.iloc[22],
crime_2011_fixed.iloc[22], crime_2010_fixed.iloc[22], crime_2009_fixed.iloc[22],
crime_2008_fixed.iloc[22], crime_2007_fixed.iloc[22]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
south.insert(loc=0, column="Year", value=years)
south.index=[0,1,2,3,4,5,6,7,8,9,10]
south_officers = pandas.DataFrame([officer_2017_fixed.iloc[15], officer_2016_fixed.iloc[15], officer_2015_fixed.iloc[15],
officer_2014_fixed.iloc[15], officer_2013_fixed.iloc[15], officer_2012_fixed.iloc[15],
officer_2011_fixed.iloc[15], officer_2010_fixed.iloc[15], officer_2009_fixed.iloc[15],
officer_2008_fixed.iloc[15], officer_2007_fixed.iloc[15]])
south_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
south = pandas.concat([south, south_officers], sort=False, axis=1)
south.name = "South"
south_atlantic = pandas.DataFrame([crime_2017_fixed.iloc[25], crime_2016_fixed.iloc[25], crime_2015_fixed.iloc[25],
crime_2014_fixed.iloc[25], crime_2013_fixed.iloc[25], crime_2012_fixed.iloc[25],
crime_2011_fixed.iloc[25], crime_2010_fixed.iloc[25], crime_2009_fixed.iloc[25],
crime_2008_fixed.iloc[25], crime_2007_fixed.iloc[25]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
south_atlantic.insert(loc=0, column="Year", value=years)
south_atlantic.index=[0,1,2,3,4,5,6,7,8,9,10]
south_atlantic_officers = pandas.DataFrame([officer_2017_fixed.iloc[17], officer_2016_fixed.iloc[17], officer_2015_fixed.iloc[17],
officer_2014_fixed.iloc[17], officer_2013_fixed.iloc[17], officer_2012_fixed.iloc[17],
officer_2011_fixed.iloc[17], officer_2010_fixed.iloc[17], officer_2009_fixed.iloc[17],
officer_2008_fixed.iloc[17], officer_2007_fixed.iloc[17]])
south_atlantic_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
south_atlantic = pandas.concat([south_atlantic, south_atlantic_officers], sort=False, axis=1)
south_atlantic.name = "South Atlantic"
east_south_central = pandas.DataFrame([crime_2017_fixed.iloc[28], crime_2016_fixed.iloc[28], crime_2015_fixed.iloc[28],
crime_2014_fixed.iloc[28], crime_2013_fixed.iloc[28], crime_2012_fixed.iloc[28],
crime_2011_fixed.iloc[28], crime_2010_fixed.iloc[28], crime_2009_fixed.iloc[28],
crime_2008_fixed.iloc[28], crime_2007_fixed.iloc[28]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
east_south_central.insert(loc=0, column="Year", value=years)
east_south_central.index=[0,1,2,3,4,5,6,7,8,9,10]
east_south_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[19], officer_2016_fixed.iloc[19], officer_2015_fixed.iloc[19],
officer_2014_fixed.iloc[19], officer_2013_fixed.iloc[19], officer_2012_fixed.iloc[19],
officer_2011_fixed.iloc[19], officer_2010_fixed.iloc[19], officer_2009_fixed.iloc[19],
officer_2008_fixed.iloc[19], officer_2007_fixed.iloc[19]])
east_south_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
east_south_central = pandas.concat([east_south_central, east_south_central_officers], sort=False, axis=1)
east_south_central.name = "East South Central"
west_south_central = pandas.DataFrame([crime_2017_fixed.iloc[31], crime_2016_fixed.iloc[31], crime_2015_fixed.iloc[31],
crime_2014_fixed.iloc[31], crime_2013_fixed.iloc[31], crime_2012_fixed.iloc[31],
crime_2011_fixed.iloc[31], crime_2010_fixed.iloc[31], crime_2009_fixed.iloc[31],
crime_2008_fixed.iloc[31], crime_2007_fixed.iloc[31]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west_south_central.insert(loc=0, column="Year", value=years)
west_south_central.index=[0,1,2,3,4,5,6,7,8,9,10]
west_south_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[21], officer_2016_fixed.iloc[21], officer_2015_fixed.iloc[21],
officer_2014_fixed.iloc[21], officer_2013_fixed.iloc[21], officer_2012_fixed.iloc[21],
officer_2011_fixed.iloc[21], officer_2010_fixed.iloc[21], officer_2009_fixed.iloc[21],
officer_2008_fixed.iloc[21], officer_2007_fixed.iloc[21]])
west_south_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west_south_central = pandas.concat([west_south_central, west_south_central_officers], sort=False, axis=1)
west_south_central.name = "West South Central"
west = pandas.DataFrame([crime_2017_fixed.iloc[34], crime_2016_fixed.iloc[34], crime_2015_fixed.iloc[34],
crime_2014_fixed.iloc[34], crime_2013_fixed.iloc[34], crime_2012_fixed.iloc[34],
crime_2011_fixed.iloc[34], crime_2010_fixed.iloc[34], crime_2009_fixed.iloc[34],
crime_2008_fixed.iloc[34], crime_2007_fixed.iloc[34]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west.insert(loc=0, column="Year", value=years)
west.index=[0,1,2,3,4,5,6,7,8,9,10]
west_officers = pandas.DataFrame([officer_2017_fixed.iloc[23], officer_2016_fixed.iloc[23], officer_2015_fixed.iloc[23],
officer_2014_fixed.iloc[23], officer_2013_fixed.iloc[23], officer_2012_fixed.iloc[23],
officer_2011_fixed.iloc[23], officer_2010_fixed.iloc[23], officer_2009_fixed.iloc[23],
officer_2008_fixed.iloc[23], officer_2007_fixed.iloc[23]])
west_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west = pandas.concat([west, west_officers], sort=False, axis=1)
west.name = "West"
mountain = pandas.DataFrame([crime_2017_fixed.iloc[37], crime_2016_fixed.iloc[37], crime_2015_fixed.iloc[37],
crime_2014_fixed.iloc[37], crime_2013_fixed.iloc[37], crime_2012_fixed.iloc[37],
crime_2011_fixed.iloc[37], crime_2010_fixed.iloc[37], crime_2009_fixed.iloc[37],
crime_2008_fixed.iloc[37], crime_2007_fixed.iloc[37]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
mountain.insert(loc=0, column="Year", value=years)
mountain.index=[0,1,2,3,4,5,6,7,8,9,10]
mountain_officers = pandas.DataFrame([officer_2017_fixed.iloc[25], officer_2016_fixed.iloc[25], officer_2015_fixed.iloc[25],
officer_2014_fixed.iloc[25], officer_2013_fixed.iloc[25], officer_2012_fixed.iloc[25],
officer_2011_fixed.iloc[25], officer_2010_fixed.iloc[25], officer_2009_fixed.iloc[25],
officer_2008_fixed.iloc[25], officer_2007_fixed.iloc[25]])
mountain_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
mountain = pandas.concat([mountain, mountain_officers], sort=False, axis=1)
mountain.name = "Mountain"
pacific = pandas.DataFrame([crime_2017_fixed.iloc[40], crime_2016_fixed.iloc[40], crime_2015_fixed.iloc[40],
crime_2014_fixed.iloc[40], crime_2013_fixed.iloc[40], crime_2012_fixed.iloc[40],
crime_2011_fixed.iloc[40], crime_2010_fixed.iloc[40], crime_2009_fixed.iloc[40],
crime_2008_fixed.iloc[40], crime_2007_fixed.iloc[40]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
pacific.insert(loc=0, column="Year", value=years)
pacific.index=[0,1,2,3,4,5,6,7,8,9,10]
pacific_officers = pandas.DataFrame([officer_2017_fixed.iloc[27], officer_2016_fixed.iloc[27], officer_2015_fixed.iloc[27],
officer_2014_fixed.iloc[27], officer_2013_fixed.iloc[27], officer_2012_fixed.iloc[27],
officer_2011_fixed.iloc[27], officer_2010_fixed.iloc[27], officer_2009_fixed.iloc[27],
officer_2008_fixed.iloc[27], officer_2007_fixed.iloc[27]])
pacific_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
pacific = pandas.concat([pacific, pacific_officers], sort=False, axis=1)
pacific.name = "Pacific"
In this stage of the life cycle we plot the data to observe potential trends and see what would be useful to analyze. Let's start by plotting the number of violent crimes over time for each region.
plt.rcParams['figure.figsize'] = [10,10]
regions = [total, northeast, new_england, middle_atlantic, midwest, east_north_central, west_north_central,
south, south_atlantic, east_south_central, west_south_central, west, mountain, pacific]
for region in regions:
plt.plot(region["Year"], region["Violent Crime"], label=region.name)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.gca().invert_xaxis()
plt.title("Violent Crime over Time")
plt.show()
plt.close()
After plotting the data we can see that over time, crime is decreasing. However, different regions have different populations, so it may be more useful to look at crime rate over time instead of the raw crime numbers.
for region in regions:
plt.plot(region["Year"], region["Violent Crime Rate per 100,000"], label=region.name)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#print(region["Year"])
plt.gca().invert_xaxis()
plt.show()
plt.close()
As we can see, looking at crime rate is more informative compared to looking at the raw numbers. Crime rate is still decreasing over time, but now we have a better visual for how things are changing over time.
Now let's plot the rates for different types of crime over time for each region.
crimes = ["Violent Crime Rate per 100,000", "Muderer Rate", "Rape (Revised) Rate", "Robbery Rate",
"Aggravated Assault Rate", "Property Crime Rate", "Burglary Rate", "Larceny-Theft Rate",
"Motor Vehicle Theft Rate"]
for i in range(14):
for crime in crimes:
#axes[i].
plt.plot(regions[i]["Year"], regions[i][crime],label=crime)
plt.title(regions[i].name)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.gca().invert_xaxis()
plt.show()
plt.close()
plt.show()
plt.close()
As we can see from these plots, for all the regions the most common crime is property crime. For most regions crime is either decreasing or staying constant. However, in some regions, like the midwest and east north central, the violent crime rate is starting to increase. In addition, in the mountain and pacific regions the motor vehicle theft rate is also starting to increase.
It may also be useful to plot by crime instead of by region.
for crime in crimes:
for region in regions:
plt.plot(region["Year"], region[crime], label=region.name)
plt.title(crime)
plt.gca().invert_xaxis()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
plt.close()
plt.show()
plt.close()
These plots confirm what we had said previously; in some regions the rate of violent crime is increasing. These plots also show that where we were wrong when we previously though that crime was decreasing or staying constant. Some crimes, such as robbery, property crime, burglary, and larceny-theft are decreasing. However, some of the others that looked constant in the region plots are actually increasing, such as rape and murder. In addition, violent crime, aggravated assault, and motor vehicle theft were decreasing over time, but in the past few years have started to increase.
Next let's look at the rate of number of officers in a region over time, to see if there has been an increase in the number of officers.
for region in regions:
plt.plot(region["Year"], region["Total"], label=region.name)
plt.gca().invert_xaxis()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
plt.close()
From this plot we can see that overall the number of officers is decreasing slightly over time. In addition, some regions have a lot more officers per 1,000 inhabitants than others. For example, in 2007 the South Atlantic had 3 officers per 1,000 whereas the Pacific only had 1.7.
In this step of the data cycle we apply machine learning and statistics in order to uncover trends and relationships in the data. We will be using linear regression to obtain a predictive model of the data, which can be used to extrapolate.
After plotting the data, we want to see if there is a relationship between crime rate and the number of officers. To do this we will plot the number of officers per 1,000 inhabitants vs the crime rate for each crime. Then we will use linear regression to find a line of best fit for the data.
for crime in crimes:
x = []
y = []
for region in regions:
x.append(region["Total"].values)
y.append(region[crime].values)
x = np.array(x)
x = x.flatten()
y = np.array(y)
y = y.flatten()
plt.scatter(x, y)
plt.title(crime)
line = linear_model.LinearRegression()
line.fit(x.reshape(-1,1),y)
predicted = line.predict(x.reshape(-1,1))
r_value = line.score(x.reshape(-1,1),y)
plt.plot(x,predicted, c='r', label=r_value)
plt.legend()
plt.show()
plt.close()
The r-values for each crime is less than .5, so there is not a strong linear relationship between number of officers and crime rate. However, the data is very scattered so there may be a relationship if we look at the data per crime and per region. Let's plot each region and crime individually to see if this is the case.
for region in regions:
for crime in crimes:
plt.scatter(region["Total"], region[crime], label=crime)
plt.title(region.name + " " + crime)
line = linear_model.LinearRegression()
line.fit(region["Total"].values.reshape(-1,1),region[crime])
predicted = line.predict(region["Total"].values.reshape(-1,1))
r_value = line.score(region["Total"].values.reshape(-1,1),region[crime])
plt.plot(region["Total"],predicted, c='r', label=r_value)
plt.legend()
plt.show()
plt.close()
From these plots, we can see that some pairs of region and crime rate do have a stronger linear relationship. For example, in the west north central region, the r-value for rape is -.87, which indicates a strong linear relationship. Another example is in the same region the r-value for property crime is .72, for burglary it is .83, and for larceny-theft it is .72. In the South the r-value for rape is -.64 and the r-value for property crime is .63, which indicates a moderate linear relationship. In addition, in the South the r-value for burglary is .72. In the south atlantic the r-value for rape is -.61, for robbery it is .58, for property crime it is .85, burglary is .94, larceny-theft is .84. In the West the r-value for robbery is .74 and in the mountain region the r-value for robbery rate is .67. In the pacific region the r-value for violent crime is .61, the r-value for murder is .55, for robbery it's .84, and for property crime it is .55. For the rest of the regions and crimes, there is no linear relationship.
From this we can see that the number of officers does not uniformly effect crime per region, or region per crime. That is, whether there is a correlation between number of officers and crime rate depends on both the region and the crime, not just region and not just the crime. Furthermore, in most cases there is a positive correlation between number of officers and crime rate! For example, in the pacific region, aggravated assault has a positive correlation with the number of officers. The only crime that has a negative correlation is rape.
In the final step of the data cycle, we draw conclusions based off our analysis to inform decisions made based on the data.
In this case, we conclude that there is no relationship between number of officers and crime rate. When analyzing the data by crime and region, most regions and crimes had no linear relationship. Of the few that did, there was actually a positive correlation, indicating that with more officers there was more crime. However, not all crimes with a correlation had a positive one, since rape, when it was correlated, had a negative correlation.
Based on this further analysis would need to be done before making any decisions. Even if there is some positive correlation between number of officers and crime rate, cities should not reduce the number of officers. There is more to the relationship between number of officers and crime rate than is apparent, and further analysis could shed light on this. Some things to look into are gender ratio of officers, crime rates per city instead of per region, and officer salaries. In addition, more advanced machine learning techniques could be applied to the data, such as Gaussian Mixture Models. To learn more about machine learning visit https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12.